Skip to content

Refactor Dataloading#13

Draft
bastitx wants to merge 8 commits intomainfrom
refactor-dataloading
Draft

Refactor Dataloading#13
bastitx wants to merge 8 commits intomainfrom
refactor-dataloading

Conversation

@bastitx
Copy link

@bastitx bastitx commented Aug 26, 2025

PR Checklist

  • Use descriptive commit messages.
  • Provide tests for your changes.
  • Update any related documentation and include any relevant screenshots.
  • Check if changes need to be made to docs (README or any guides in /docs/).
  • Reflect the changes you made in the changelog.

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update

Description

Move the data loading out of the BaseTask into a Dataloader class. This would make it possible to add additional Dataloader classes to load datasets from other sources.

Related Tickets & Documents

  • Related Issue #
  • Closes #

QA Instructions, Screenshots, Recordings

Please replace this line with instructions on how to test your changes, a note
on the hardware and config this has been tested on, as well as any relevant
additional information.

Added/updated tests?

  • Yes
  • No, and this is why: please replace this line with details on why tests
    have not been included
  • I need help with writing tests

[optional] Are there any post deployment tasks we need to perform?

@bastitx bastitx force-pushed the refactor-dataloading branch 2 times, most recently from 213289d to e942df4 Compare August 26, 2025 09:34
@bastitx bastitx force-pushed the refactor-dataloading branch from e942df4 to f652d81 Compare August 26, 2025 09:37
@bastitx bastitx force-pushed the refactor-dataloading branch from e3af211 to 9aae64d Compare August 27, 2025 20:53
@bastitx bastitx marked this pull request as ready for review August 28, 2025 07:23
@bastitx bastitx requested a review from a team August 28, 2025 07:23
@bastitx bastitx marked this pull request as draft August 28, 2025 08:43
self.result_processor = result_processor
self.num_samples = config.num_samples
self.save_intermediate_results = config.save_intermediate_results
self.dataloader = HFDataloader()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the idea be that we're keeping this hard coded atm and only once new dataloaders would be needed we refactor such that you can inject a dataloader into the ResponseGenerator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about the requirements: is it that a task can be loaded from HF as well as from some other source by using the same path (just by switching a dataloader) or that certain tasks are to be loaded from HF and some other tasks from some other source?

@@ -224,7 +225,7 @@ def _check_no_duplicate_names(cls) -> None:

def make_sure_all_hf_datasets_are_in_cache() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was not part of this PR, but I realized this function is only used in the ci. Maybe a comment about that would be adequate. wdyt?

@FelixReinfurtAA
Copy link
Contributor

I really like the abstraction of the dataloader. So far lgtm overall.
Do you plan to incorporate another dataloader as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants